Introduction to R and Bioconductor - hands-on session
@FedeBioinfoThis lecture is inspired in its structure and organisation by the “Introduction to data analysis with R and Bioconductor” - https://carpentries-incubator.github.io/bioc-intro/
Available at www.r-project.org
What can you do with R?
Why should I use R?
Why should I not use R?
Get R - and RStudio
Alternatives:
Open up RStudio - You’ll have four panes
Want to customize this?
Tools -> Global Options -> Pane Layout
Advantages of an IDE
It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory.
Why?
How?
RStudio projects!
Custom settings, per project.
Let’s create one live now - and have the workspace NOT saved
What is the best structure?
One, used consistently - not gonna touch on naming things as it can get hot quickly :)
With R…
dir.create()
file.edit()
Where am I doing things?
The working directory is the place from where R will be looking for and saving the files.
getwd()/setwd(), but not in your scripts (fails on others’ computers!)
Instructions, commands.
Scripts, console - use the editor and have a complete record on what you did!
Shortcuts FTW!
Even better: Reproducible documents, with R Markdown
Nice resources on top: * RStudio cheatsheet about the RStudio IDE! * the internet/rstats community!
The main point: describe well your problem, “help others help you”
Others need to reproduce your error to help you better: saveRDS(), dput(), sessionInfo()
R packages…
Repositories:
My contributions so far:
https://bioconductor.org/packages/flowcatchR/https://bioconductor.org/packages/pcaExplorer/https://bioconductor.org/packages/ideal/https://bioconductor.org/packages/iSEE)https://bioconductor.org/packages/GeneTonic)For Bioconductor packages…
install.packages("BiocManager")
library("BiocManager")
BiocManager::install()
Relevant commands:
install.packages("packagename") - check it online at CRAN!installed.packages().libPaths()update.packages()library("packagename")help(package="packagename"), data(), browseVignettes(), vignette(), citation("packagename")Something you might have already done:
BiocManager::install("SummarizedExperiment")
BiocManager::install("DESeq2")
Here we will touch on the first commands in R, so that you can
… but not just that.
Type the following
2 + 2
# [1] 4
log(2)
# [1] 0.6931472
347 * 73841
# [1] 25622827
7/2
# [1] 3.5
7%/%2
# [1] 3
7%%2
# [1] 1
# this calls the help for a function to plot a histogram
?hist
# this is just the same
help(hist) ## what about ??
?apropos
apropos("row")
# [1] ".row" ".rowMeans"
# [3] ".rowNamesDF<-" ".rowSums"
# [5] ".rs.api.tutorialLaunchBrowser" ".rs.explorer.defaultRowLimit"
# [7] ".rs.formatRowNames" ".rs.isBrowserActive"
# [9] ".rs.nrow" ".rs.refreshShinyLaunchBrowserOption"
# [11] ".rs.tutorial.launchBrowser" "add_row"
# [13] "add_row" "add_rownames"
# [15] "arrow" "arrows"
# [17] "as_tibble_row" "auto_browse"
# [19] "bind_rows" "bindROWS"
# [21] "bindROWS" "bindROWS"
# [23] "browseEnv" "browseKEGG"
# [25] "browser" "browserCondition"
# [27] "browserSetDebug" "browserText"
# [29] "browseUCSCtrack" "browseURL"
# [31] "browseVignettes" "colAvgsPerRowSet"
# [33] "colAvgsPerRowSet" "column_to_rownames"
# [35] "combineRows" "cur_group_rows"
# [37] "db_query_rows" "dplyr_row_slice"
# [39] "elementNROWS" "elementNROWS"
# [41] "extractROWS" "extractROWS"
# [43] "group_rows" "has_rownames"
# [45] "indexByRow" "makeClassinfoRowForCompactPrinting"
# [47] "mergeROWS" "n2mfrow"
# [49] "narrow" "narrow"
# [51] "narrow" "new_rowwise_df"
# [53] "nrow" "nrow"
# [55] "nrow" "nrow"
# [57] "nrow" "nrow"
# [59] "nrow" "NROW"
# [61] "NROW" "NROW"
# [63] "nrows" "nrows"
# [65] "panel_rows" "PlantGrowth"
# [67] "remove_rownames" "replaceROWS"
# [69] "replaceROWS" "row"
# [71] "row_number" "row.names"
# [73] "row.names.data.frame" "row.names.default"
# [75] "row.names<-" "row.names<-.data.frame"
# [77] "row.names<-.default" "rowAlls"
# [79] "rowAlls" "rowAnyMissings"
# [81] "rowAnyNAs" "rowAnyNAs"
# [83] "rowAnys" "rowAnys"
# [85] "rowAvgsPerColSet" "rowAvgsPerColSet"
# [87] "rowCollapse" "rowCollapse"
# [89] "rowCounts" "rowCounts"
# [91] "rowCummaxs" "rowCummaxs"
# [93] "rowCummins" "rowCummins"
# [95] "rowCumprods" "rowCumprods"
# [97] "rowCumsums" "rowCumsums"
# [99] "rowData" "rowData"
# [101] "rowData<-" "rowDataColorMap"
# [103] "rowDataColorMap<-" "RowDataPlot"
# [105] "RowDataTable" "rowDiffs"
# [107] "rowDiffs" "rowid_to_column"
# [109] "rowIQRDiffs" "rowIQRDiffs"
# [111] "rowIQRs" "rowIQRs"
# [113] "rowLogSumExps" "rowLogSumExps"
# [115] "rowMadDiffs" "rowMadDiffs"
# [117] "rowMads" "rowMads"
# [119] "rowMax" "rowMaxs"
# [121] "rowMaxs" "rowMeans"
# [123] "rowMeans" "rowMeans2"
# [125] "rowMeans2" "rowMedians"
# [127] "rowMedians" "rowMedians"
# [129] "rowMin" "rowMins"
# [131] "rowMins" "rownames"
# [133] "rownames" "rownames"
# [135] "rownames" "rownames"
# [137] "ROWNAMES" "ROWNAMES"
# [139] "rownames_to_column" "rownames<-"
# [141] "rownames<-" "rownames<-"
# [143] "rownames<-" "ROWNAMES<-"
# [145] "ROWNAMES<-" "rowOrderStats"
# [147] "rowOrderStats" "rowPair"
# [149] "rowPair<-" "rowPairNames"
# [151] "rowPairNames<-" "rowPairs"
# [153] "rowPairs<-" "rowProds"
# [155] "rowProds" "rowQ"
# [157] "rowQuantiles" "rowQuantiles"
# [159] "rowRanges" "rowRanges"
# [161] "rowRanges" "rowRanges<-"
# [163] "rowRanks" "rowRanks"
# [165] "rows_append" "rows_delete"
# [167] "rows_insert" "rows_patch"
# [169] "rows_update" "rows_upsert"
# [171] "rowSdDiffs" "rowSdDiffs"
# [173] "rowSds" "rowSds"
# [175] "rowSelectionColorMap" "rowSubset"
# [177] "rowSubset<-" "rowsum"
# [179] "rowsum.data.frame" "rowsum.default"
# [181] "rowsum.DGEList" "rowsum.SummarizedExperiment"
# [183] "rowSums" "rowSums"
# [185] "rowSums2" "rowSums2"
# [187] "rowTabulates" "rowTabulates"
# [189] "rowVarDiffs" "rowVarDiffs"
# [191] "rowVars" "rowVars"
# [193] "rowWeightedMads" "rowWeightedMads"
# [195] "rowWeightedMeans" "rowWeightedMeans"
# [197] "rowWeightedMedians" "rowWeightedMedians"
# [199] "rowWeightedSds" "rowWeightedSds"
# [201] "rowWeightedVars" "rowWeightedVars"
# [203] "rowwise" "sameAsPreviousROW"
# [205] "separate_rows" "separate_rows_"
# [207] "tibble_row" "ToothGrowth"
# [209] "validate_rowwise_df" "xpdrows.data.frame"
#rstats)getwd() and setwd() - Tab is your friend<-, =: the assignment operatorls(), rm()str()example(), help()/?[function]print()q()/quit()TRUE,FALSE,!,==,!=,<,>,<=,>=,|,&,xor()c()carsFind out what these do!
?getwd
?setwd
?`<-`
help(ls)
help(rm)
?str
?example
help(help)
?print
help(quit)
?c
?cars
# This is a comment
Careful here:
Grab some mini-postit!
iris dataset. What is it about at all? How many variables are included? How many observations?replicate! find out a function that replicates elements of a vector to produce this1 1 1 1
BONUS: … and this
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(1,4)
# [1] 1 1 1 1
rep(c(1,2,3,4,5),3)
# [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
R can recognize different general types of data
numbers (numeric)
character strings (text)
logical (e.g. class(TRUE))
factors (“integers with a set of labels”) - it is categorical data!
special ones: dates, time, …
When and where to use which?
In a previous edition of a similar course, I wanted to know:
Our aims:
R Markdown allows you to create documents that serve as a neat record of your analysis.
Why?
The key point is…
R Markdown documents present your code alongside its output (graphs, tables, etc.) with conventional text to explain it, a bit like a notebook. To do this, R Markdown uses markdown syntax.
Markdown is a very simple markup language which provides methods for creating documents with headers, images, links etc. from plain text files, while keeping the original plain text file easy to read.
You can convert Markdown documents to many other file types like .html or .pdf to display the headers, images etc..
It might sound complicated. But really isn’t,
First things first: install the required software
install.packages("rmarkdown")
library(rmarkdown)
knitr comes along, pandoc too. You should quickly be all set!ABC here, let’s go through it:
http://rmarkdown.rstudio.com/authoring_basics.html
plus… a beautiful cheat sheet is there for you!
http://rmarkdown.rstudio.com/lesson-15.html
http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf
Using LaTeX? No problem, you can use \(\LaTeX\) here as well!
\(\left( f(x) = \sum_{i=0}^{n} \frac{a_i}{1+x} \right)\)
What can you do with Rmarkdown?
The price to pay to have an Rmd document is sooooo small - and for that, you get
To create a new R Markdown file (.Rmd), select File -> New File -> R Markdown... in RStudio, then choose the file type you want to create.
The newly created .Rmd file comes with basic instructions but we want to create our own R Markdown script, so let’s get to know the different parts of an Rmd file
---sUh, you can insert figures also like this

```r
n <- 10
rnorm(n)
# [1] 0.2750427 1.1800167 1.1070905 0.1732659 -0.2742060 1.5992610 -2.0094319 -0.7177109
# [9] 0.4198595 -1.1974104
```
Shortcut: Ctrl + Alt + I
Input code: you can use multiple languages including R, Python, and SQL, many more (specify the language in the chunk options)
Inline code can be added with `r 1+1`
Deatiled very nicely here: https://yihui.name/knitr/options/
A simple set of options which you can use for many documents:
set.seed(42)
knitr::opts_chunk$set(
comment = NA,
fig.align = "center",
fig.width = 7,
fig.height = 7,
warning = FALSE,
eval = TRUE
)
Use the Knit button in the RStudio IDE to render the file and preview the output with a single click or keyboard shortcut (Ctrl + Shift + K).
To generate a report from the file, run the render command (works also outside of RStudio):
library("rmarkdown")
rmarkdown::render("yourfile.Rmd")
It was a deep dive, but now…
You can do much much more (presentations, websites, manuscripts,…)
File -> New File -> R Markdown... in RStudiooutput:
word_document
80-20? 90-10? Import, clean, prepare, transform your data
Sources:
havenreadxl - highly recommended!… and exporting
read.table(),write.table() + read.csv|delimstringsAsFactors=FALSEload(),save()/readRDS(),saveRDS()haven : read_sas(),read_spss() /write_sas(),write_sav()readxl: read_excel()Check out their documentation pages!
Other options: rio, RStudio GUI
Go to https://github.com/federicomarini/rbioc2016
-> inst/extdata
-> survey_responses.csv, in its raw format
You can load it directly like this
surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")
Or install the package and load it from there
library("devtools")
install_github("federicomarini/rbioc2016")
library("rbioc2016")
data(surveyrbioc)
Sometimes your data is either small and/or not in an Excel-like tabular format.
What to do? You combine the elements together!
Q1 <- c(28,27,33,32,29)
# should return this
Q1
[1] 28 27 33 32 29
Q2 <- c("PhD student","PhD student", "Postdoc","PhD student","PhD student")
Q2
[1] "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
# ... and so on
We have seen c(). We also have
cbindrbindfirstTwo <- cbind(Q1,Q2)
firstTwo
Q1 Q2
[1,] "28" "PhD student"
[2,] "27" "PhD student"
[3,] "33" "Postdoc"
[4,] "32" "PhD student"
[5,] "29" "PhD student"
rbind(Q1,Q2)
[,1] [,2] [,3] [,4] [,5]
Q1 "28" "27" "33" "32" "29"
Q2 "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
Is this what you wanted?
But first, what can you do on these objects?
sum(Q1)
[1] 149
sum(Q2)
Error in sum(Q2): invalid 'type' (character) of argument
summary(Q1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
27.0 28.0 29.0 29.8 32.0 33.0
summary(Q2)
Length Class Mode
5 character character
str(Q1)
num [1:5] 28 27 33 32 29
str(Q2)
chr [1:5] "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
mean(Q1)
[1] 29.8
dim(firstTwo)
[1] 5 2
firstTwo[,1]
[1] "28" "27" "33" "32" "29"
mean(firstTwo[,1]) # Why, damn, why? Meet coercion
[1] NA
class(firstTwo)
[1] "matrix" "array"
matrix, data.frame and listmatrix can contain one type of data - if numeric, you unleash all the matrix algebra power!data.frame can store more types of data (one per column)list is like a big box where you can put anything - but this is not always what you wantWhat is best?
Let’s try with a list
Q3 <- c("intermediate","poor","good","none","intermediate")
mylist <- list(Q1,Q2,Q3)
mylist
[[1]]
[1] 28 27 33 32 29
[[2]]
[1] "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
[[3]]
[1] "intermediate" "poor" "good" "none" "intermediate"
## access your elements with
mylist[[1]]
[1] 28 27 33 32 29
mylist[[1]][2]
[1] 27
How do we create a data.frame?
mydf <- data.frame(age = Q1,
level = Q2,
rexp = Q3)
mydf
age level rexp
1 28 PhD student intermediate
2 27 PhD student poor
3 33 Postdoc good
4 32 PhD student none
5 29 PhD student intermediate
class(mydf$age)
[1] "numeric"
data.framemydf$age # it's all about the money :)
[1] 28 27 33 32 29
mydf[,1]
[1] 28 27 33 32 29
names(mydf)
[1] "age" "level" "rexp"
rownames(mydf)
[1] "1" "2" "3" "4" "5"
dim(mydf)
[1] 5 3
nrow(mydf)
[1] 5
ncol(mydf)
[1] 3
surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")
head(surveyrbioc)
Q1 Q2 Q3 Q4 Q5 Q6
1 28 PhD student intermediate intermediate good I am a regular user
2 27 PhD student poor intermediate poor I know it exists
3 33 Postdoc good good good I know it exists
4 32 PhD student poor poor poor Is this supposed to be in the cloud?!
5 29 PhD student none poor poor I know it exists
6 33 Postdoc intermediate intermediate intermediate Once in a while i used it
Q7
1 Learn Parallelization with R (and Bioconductor)
2 <NA>
3 use R scripts in parallel context (ex: alignments in RNA-seq)
4 Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
5 Possible I will use R for editing RNA-seq data
6 I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
tail(surveyrbioc)
Q1 Q2 Q3 Q4 Q5
17 28 Master student/else intermediate intermediate intermediate
18 39 Postdoc good intermediate good
19 32 Master student/else none poor genoWhat?
20 29 PhD student poor none poor
21 31 PhD student none none good
22 30 PhD student none none good
Q6 Q7
17 I know it exists better understanding of R and to extend my knowledge
18 I am a regular user Mainly interested in parallel computing options
19 Is this supposed to be in the cloud?! <NA>
20 Is this supposed to be in the cloud?! To better understand R and perform basic analysis
21 I heard we had some servers around working with fastq files based on R
22 Is this supposed to be in the cloud?! I want to be able to analyze my sequence data on my own :P
names(surveyrbioc)
[1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7"
# View(surveyrbioc)
str(surveyrbioc)
'data.frame': 22 obs. of 7 variables:
$ Q1: int 28 27 33 32 29 33 40 23 27 23 ...
$ Q2: chr "PhD student" "PhD student" "Postdoc" "PhD student" ...
$ Q3: chr "intermediate" "poor" "good" "poor" ...
$ Q4: chr "intermediate" "intermediate" "good" "poor" ...
$ Q5: chr "good" "poor" "good" "poor" ...
$ Q6: chr "I am a regular user" "I know it exists" "I know it exists" "Is this supposed to be in the cloud?!" ...
$ Q7: chr "Learn Parallelization with R (and Bioconductor)" NA "use R scripts in parallel context (ex: alignments in RNA-seq)" "Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would"| __truncated__ ...
summary(surveyrbioc)
Q1 Q2 Q3 Q4 Q5
Min. :23.00 Length:22 Length:22 Length:22 Length:22
1st Qu.:27.25 Class :character Class :character Class :character Class :character
Median :29.50 Mode :character Mode :character Mode :character Mode :character
Mean :30.14
3rd Qu.:32.75
Max. :40.00
Q6 Q7
Length:22 Length:22
Class :character Class :character
Mode :character Mode :character
surveyrbioc[, ]
Q1 Q2 Q3 Q4 Q5
1 28 PhD student intermediate intermediate good
2 27 PhD student poor intermediate poor
3 33 Postdoc good good good
4 32 PhD student poor poor poor
5 29 PhD student none poor poor
6 33 Postdoc intermediate intermediate intermediate
7 40 Postdoc good good intermediate
8 23 Master student/else poor poor good
9 27 PhD student none poor intermediate
10 23 Master student/else poor poor poor
11 35 Postdoc poor poor intermediate
12 34 Master student/else poor pro poor
13 31 Postdoc none none intermediate
14 27 PhD student none poor poor
15 24 Master student/else poor intermediate intermediate
16 28 PhD student none poor poor
17 28 Master student/else intermediate intermediate intermediate
18 39 Postdoc good intermediate good
19 32 Master student/else none poor genoWhat?
20 29 PhD student poor none poor
21 31 PhD student none none good
22 30 PhD student none none good
Q6
1 I am a regular user
2 I know it exists
3 I know it exists
4 Is this supposed to be in the cloud?!
5 I know it exists
6 Once in a while i used it
7 I am a regular user
8 I know it exists
9 I know it exists
10 Is this supposed to be in the cloud?!
11 I know it exists
12 I know it exists
13 Is this supposed to be in the cloud?!
14 I know it exists
15 I know it exists
16 I know it exists
17 I know it exists
18 I am a regular user
19 Is this supposed to be in the cloud?!
20 Is this supposed to be in the cloud?!
21 I heard we had some servers around
22 Is this supposed to be in the cloud?!
Q7
1 Learn Parallelization with R (and Bioconductor)
2 <NA>
3 use R scripts in parallel context (ex: alignments in RNA-seq)
4 Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
5 Possible I will use R for editing RNA-seq data
6 I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
7 <NA>
8 Get to know R better, basic commands
9 To look if I can process my sequencing data on my own using R
10 I would like to learn more about R, especially how to use it in biomedial research. Im in the 2nd Semester of the Rasterprogramm biomedicine, last Semester I had two weeks bioinformatics and we used to work with R for statistics/ChIP-Seq/microarray-data analysis, but the time was too short, to go deep into it, we just scratched the surface. So now I wish to learn some more, would be great, if I could work with the programme by myself, for example for the masterthesis.
11 learn more about R
12 Brush up on some R knowledge and maybe get some different perspective on Genome processing
13 to get a good and understandable introduction into R programming and bioinformatics
14 learning how to work with R
15 To learn the basics of R programming
16 <NA>
17 better understanding of R and to extend my knowledge
18 Mainly interested in parallel computing options
19 <NA>
20 To better understand R and perform basic analysis
21 working with fastq files based on R
22 I want to be able to analyze my sequence data on my own :P
Using the surveyrbioc object:
max can be your help)transpose the survey data and assign it to another variablemean(surveyrbioc$Q1)
[1] 30.13636
max(surveyrbioc$Q1)
[1] 40
my_transposed_survey <- t(surveyrbioc)
surveyrbioc_mod <- surveyrbioc
colnames(surveyrbioc_mod) <- c("age","level","rlevel","prog_level","genomics_level","parcomp_level","expectation")
surveyrbioc_mod$expectation[which.min(surveyrbioc_mod$age)]
[1] "Get to know R better, basic commands"
Describe, explore, transform, summarise data
dim(x) shows the dimensions of an objectstr(x) provides an overview of the structure of an object and the elements it containssum(x), mean(x), sd(x) computes the sum, mean, or standard deviation of all the elements in x; median(x), quantile(x)length(x) returns the number of elements in x (a vector)sqrt(x), log(x) take the square root and the natural logarithm of a numeric - element or vectorhist(x, breaks=20, col="blue") plots a histogram of variable x with 20 bins colored blueunique(x) returns the vector of unique elements in xrm(x) removes the object x from the environment (rm(list=ls()) removes all objects)sessionInfo() prints information about R session and versions of all attached packagesThis is the basic way it works
surveyrbioc[ROWS,COLUMNS]
You can subset with…
Try to make a guess, given this vector.
vec <- c(6, 1, 3, 6, 10, 5)
What happens if you do this?
vec[2]
[1] 1
vec[c(5, 6)]
[1] 10 5
vec[-c(5,6)]
[1] 6 1 3 6
vec > 5
[1] TRUE FALSE FALSE TRUE TRUE FALSE
vec[vec > 5]
[1] 6 6 10
What happens if you do this?
df <- data.frame(
name = c("John", "Paul", "George", "Ringo"),
birth = c(1940, 1942, 1943, 1940),
instrument = c("guitar", "bass", "guitar", "drums")
)
df
name birth instrument
1 John 1940 guitar
2 Paul 1942 bass
3 George 1943 guitar
4 Ringo 1940 drums
df[c(2, 4), 3]
[1] "bass" "drums"
df[ , 1]
[1] "John" "Paul" "George" "Ringo"
df[ , "instrument"]
[1] "guitar" "bass" "guitar" "drums"
df$instrument
[1] "guitar" "bass" "guitar" "drums"
Back to the survey
# I just want the age
surveyrbioc[,1]
[1] 28 27 33 32 29 33 40 23 27 23 35 34 31 27 24 28 28 39 32 29 31 30
# or
surveyrbioc$Q1
[1] 28 27 33 32 29 33 40 23 27 23 35 34 31 27 24 28 28 39 32 29 31 30
# the first 4 columns
surveyrbioc[,c(1,2,3,4)]
Q1 Q2 Q3 Q4
1 28 PhD student intermediate intermediate
2 27 PhD student poor intermediate
3 33 Postdoc good good
4 32 PhD student poor poor
5 29 PhD student none poor
6 33 Postdoc intermediate intermediate
7 40 Postdoc good good
8 23 Master student/else poor poor
9 27 PhD student none poor
10 23 Master student/else poor poor
11 35 Postdoc poor poor
12 34 Master student/else poor pro
13 31 Postdoc none none
14 27 PhD student none poor
15 24 Master student/else poor intermediate
16 28 PhD student none poor
17 28 Master student/else intermediate intermediate
18 39 Postdoc good intermediate
19 32 Master student/else none poor
20 29 PhD student poor none
21 31 PhD student none none
22 30 PhD student none none
surveyrbioc[,1:4]
Q1 Q2 Q3 Q4
1 28 PhD student intermediate intermediate
2 27 PhD student poor intermediate
3 33 Postdoc good good
4 32 PhD student poor poor
5 29 PhD student none poor
6 33 Postdoc intermediate intermediate
7 40 Postdoc good good
8 23 Master student/else poor poor
9 27 PhD student none poor
10 23 Master student/else poor poor
11 35 Postdoc poor poor
12 34 Master student/else poor pro
13 31 Postdoc none none
14 27 PhD student none poor
15 24 Master student/else poor intermediate
16 28 PhD student none poor
17 28 Master student/else intermediate intermediate
18 39 Postdoc good intermediate
19 32 Master student/else none poor
20 29 PhD student poor none
21 31 PhD student none none
22 30 PhD student none none
# all but the last column
surveyrbioc[,-7]
Q1 Q2 Q3 Q4 Q5
1 28 PhD student intermediate intermediate good
2 27 PhD student poor intermediate poor
3 33 Postdoc good good good
4 32 PhD student poor poor poor
5 29 PhD student none poor poor
6 33 Postdoc intermediate intermediate intermediate
7 40 Postdoc good good intermediate
8 23 Master student/else poor poor good
9 27 PhD student none poor intermediate
10 23 Master student/else poor poor poor
11 35 Postdoc poor poor intermediate
12 34 Master student/else poor pro poor
13 31 Postdoc none none intermediate
14 27 PhD student none poor poor
15 24 Master student/else poor intermediate intermediate
16 28 PhD student none poor poor
17 28 Master student/else intermediate intermediate intermediate
18 39 Postdoc good intermediate good
19 32 Master student/else none poor genoWhat?
20 29 PhD student poor none poor
21 31 PhD student none none good
22 30 PhD student none none good
Q6
1 I am a regular user
2 I know it exists
3 I know it exists
4 Is this supposed to be in the cloud?!
5 I know it exists
6 Once in a while i used it
7 I am a regular user
8 I know it exists
9 I know it exists
10 Is this supposed to be in the cloud?!
11 I know it exists
12 I know it exists
13 Is this supposed to be in the cloud?!
14 I know it exists
15 I know it exists
16 I know it exists
17 I know it exists
18 I am a regular user
19 Is this supposed to be in the cloud?!
20 Is this supposed to be in the cloud?!
21 I heard we had some servers around
22 Is this supposed to be in the cloud?!
# if you don't know we had 7 columns...
surveyrbioc[,-ncol(surveyrbioc)]
Q1 Q2 Q3 Q4 Q5
1 28 PhD student intermediate intermediate good
2 27 PhD student poor intermediate poor
3 33 Postdoc good good good
4 32 PhD student poor poor poor
5 29 PhD student none poor poor
6 33 Postdoc intermediate intermediate intermediate
7 40 Postdoc good good intermediate
8 23 Master student/else poor poor good
9 27 PhD student none poor intermediate
10 23 Master student/else poor poor poor
11 35 Postdoc poor poor intermediate
12 34 Master student/else poor pro poor
13 31 Postdoc none none intermediate
14 27 PhD student none poor poor
15 24 Master student/else poor intermediate intermediate
16 28 PhD student none poor poor
17 28 Master student/else intermediate intermediate intermediate
18 39 Postdoc good intermediate good
19 32 Master student/else none poor genoWhat?
20 29 PhD student poor none poor
21 31 PhD student none none good
22 30 PhD student none none good
Q6
1 I am a regular user
2 I know it exists
3 I know it exists
4 Is this supposed to be in the cloud?!
5 I know it exists
6 Once in a while i used it
7 I am a regular user
8 I know it exists
9 I know it exists
10 Is this supposed to be in the cloud?!
11 I know it exists
12 I know it exists
13 Is this supposed to be in the cloud?!
14 I know it exists
15 I know it exists
16 I know it exists
17 I know it exists
18 I am a regular user
19 Is this supposed to be in the cloud?!
20 Is this supposed to be in the cloud?!
21 I heard we had some servers around
22 Is this supposed to be in the cloud?!
# you can subset with logical vectors, by row and by column
surveyrbioc[c(rep(TRUE,10),rep(FALSE,8)),]
Q1 Q2 Q3 Q4 Q5
1 28 PhD student intermediate intermediate good
2 27 PhD student poor intermediate poor
3 33 Postdoc good good good
4 32 PhD student poor poor poor
5 29 PhD student none poor poor
6 33 Postdoc intermediate intermediate intermediate
7 40 Postdoc good good intermediate
8 23 Master student/else poor poor good
9 27 PhD student none poor intermediate
10 23 Master student/else poor poor poor
19 32 Master student/else none poor genoWhat?
20 29 PhD student poor none poor
21 31 PhD student none none good
22 30 PhD student none none good
Q6
1 I am a regular user
2 I know it exists
3 I know it exists
4 Is this supposed to be in the cloud?!
5 I know it exists
6 Once in a while i used it
7 I am a regular user
8 I know it exists
9 I know it exists
10 Is this supposed to be in the cloud?!
19 Is this supposed to be in the cloud?!
20 Is this supposed to be in the cloud?!
21 I heard we had some servers around
22 Is this supposed to be in the cloud?!
Q7
1 Learn Parallelization with R (and Bioconductor)
2 <NA>
3 use R scripts in parallel context (ex: alignments in RNA-seq)
4 Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
5 Possible I will use R for editing RNA-seq data
6 I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
7 <NA>
8 Get to know R better, basic commands
9 To look if I can process my sequencing data on my own using R
10 I would like to learn more about R, especially how to use it in biomedial research. Im in the 2nd Semester of the Rasterprogramm biomedicine, last Semester I had two weeks bioinformatics and we used to work with R for statistics/ChIP-Seq/microarray-data analysis, but the time was too short, to go deep into it, we just scratched the surface. So now I wish to learn some more, would be great, if I could work with the programme by myself, for example for the masterthesis.
19 <NA>
20 To better understand R and perform basic analysis
21 working with fastq files based on R
22 I want to be able to analyze my sequence data on my own :P
surveyrbioc[c(TRUE,FALSE),] # keep in mind this behavior!
Q1 Q2 Q3 Q4 Q5
1 28 PhD student intermediate intermediate good
3 33 Postdoc good good good
5 29 PhD student none poor poor
7 40 Postdoc good good intermediate
9 27 PhD student none poor intermediate
11 35 Postdoc poor poor intermediate
13 31 Postdoc none none intermediate
15 24 Master student/else poor intermediate intermediate
17 28 Master student/else intermediate intermediate intermediate
19 32 Master student/else none poor genoWhat?
21 31 PhD student none none good
Q6
1 I am a regular user
3 I know it exists
5 I know it exists
7 I am a regular user
9 I know it exists
11 I know it exists
13 Is this supposed to be in the cloud?!
15 I know it exists
17 I know it exists
19 Is this supposed to be in the cloud?!
21 I heard we had some servers around
Q7
1 Learn Parallelization with R (and Bioconductor)
3 use R scripts in parallel context (ex: alignments in RNA-seq)
5 Possible I will use R for editing RNA-seq data
7 <NA>
9 To look if I can process my sequencing data on my own using R
11 learn more about R
13 to get a good and understandable introduction into R programming and bioinformatics
15 To learn the basics of R programming
17 better understanding of R and to extend my knowledge
19 <NA>
21 working with fastq files based on R
# guess what this does?
surveyrbioc$Q2=="PhD student"
[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
[17] FALSE FALSE FALSE TRUE TRUE TRUE
sum(surveyrbioc$Q2 == "PhD student")
[1] 10
mean(surveyrbioc$Q2 == "PhD student")
[1] 0.4545455
mean(surveyrbioc$Q1[surveyrbioc$Q2 == "PhD student"])
[1] 28.8
sum(surveyrbioc$Q1 >= 30)
[1] 11
sum(surveyrbioc$Q1 < 35 & surveyrbioc$Q2 == "Postdoc")
[1] 3
sum(is.na(surveyrbioc$Q7))
[1] 4
You can
sort and order)merge them)myord <- order(surveyrbioc$Q1)
myord
[1] 8 10 15 2 9 14 1 16 17 5 20 22 13 21 4 19 3 6 12 11 18 7
head(surveyrbioc[myord,1:5],4)
Q1 Q2 Q3 Q4 Q5
8 23 Master student/else poor poor good
10 23 Master student/else poor poor poor
15 24 Master student/else poor intermediate intermediate
2 27 PhD student poor intermediate poor
sorted_surv <- surveyrbioc[myord,1:6]
sort() returns you the sorted data, order() the indices only
# transforming a variable
newsurvey <- surveyrbioc[,1:5]
newsurvey$ageroot <- sqrt(newsurvey$Q1)
head(newsurvey)
Q1 Q2 Q3 Q4 Q5 ageroot
1 28 PhD student intermediate intermediate good 5.291503
2 27 PhD student poor intermediate poor 5.196152
3 33 Postdoc good good good 5.744563
4 32 PhD student poor poor poor 5.656854
5 29 PhD student none poor poor 5.385165
6 33 Postdoc intermediate intermediate intermediate 5.744563
# creating groups out of a continuous variable
newsurvey$agegroup <- cut(newsurvey$Q1,breaks = c(20,30,40))
head(newsurvey)
Q1 Q2 Q3 Q4 Q5 ageroot agegroup
1 28 PhD student intermediate intermediate good 5.291503 (20,30]
2 27 PhD student poor intermediate poor 5.196152 (20,30]
3 33 Postdoc good good good 5.744563 (30,40]
4 32 PhD student poor poor poor 5.656854 (30,40]
5 29 PhD student none poor poor 5.385165 (20,30]
6 33 Postdoc intermediate intermediate intermediate 5.744563 (30,40]
Use case for merge: you have two sets you are playing with! Think in advance what you need for that purpose…
Are PhD students significantly younger than postdocs? Are there any differences in the age of the three groups?
phds <- surveyrbioc[surveyrbioc$Q2=="PhD student",]
postdocs <- surveyrbioc[surveyrbioc$Q2=="Postdoc",]
t.test(phds$Q1,postdocs$Q1)
Welch Two Sample t-test
data: phds$Q1 and postdocs$Q1
t = -4.0528, df = 6.4476, p-value = 0.005767
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.146796 -2.586537
sample estimates:
mean of x mean of y
28.80000 35.16667
aov(data=surveyrbioc,Q1~Q2) # What is missing here?
Call:
aov(formula = Q1 ~ Q2, data = surveyrbioc)
Terms:
Q2 Residuals
Sum of Squares 216.8242 207.7667
Deg. of Freedom 2 19
Residual standard error: 3.306824
Estimated effects may be unbalanced
Much more on this: in the next courses!
tapply
You want to calculate the median age of each academic group in here
md <- median(surveyrbioc$Q1)
md_master <- median(surveyrbioc$Q1[surveyrbioc$Q2=="Master student/else"])
md_phd <- median(surveyrbioc$Q1[surveyrbioc$Q2=="PhD student"])
md_postdocs <- median(surveyrbioc$Q1[surveyrbioc$Q2=="Postdoc"])
c(md_master,md_phd,md_postdocs)
[1] 26.0 28.5 34.0
tapply splits the data of the first variable on the levels of the second variable, and applies the function (any function)
tapply(X = surveyrbioc$Q1,INDEX = surveyrbioc$Q2,FUN = median)
Master student/else PhD student Postdoc
26.0 28.5 34.0
lapply and sapply
Back to our iris dataset
names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
We want the average sepal length and width, and the same for the petals. Uh, and we want the standard deviation too.
# the unefficient way:
seplen_m <- mean(iris$Sepal.Length)
sepwid_m <- mean(iris$Sepal.Width)
petlen_m <- mean(iris$Petal.Length)
petwid_m <- mean(iris$Petal.Width)
seplen_m <- sd(iris$Sepal.Length)
# ... and so on
-> Apply a Function over a List or Vector
# we will use just the first four columns
lapply(iris[,1:4],mean)
$Sepal.Length
[1] 5.843333
$Sepal.Width
[1] 3.057333
$Petal.Length
[1] 3.758
$Petal.Width
[1] 1.199333
sapply(iris[,1:4],mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
lapply(iris[,1:4],sd)
$Sepal.Length
[1] 0.8280661
$Sepal.Width
[1] 0.4358663
$Petal.Length
[1] 1.765298
$Petal.Width
[1] 0.7622377
# ...
The major difference is in the presentation of the output
summary
Try out summary on a data.frame
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Alternatives in other packages:
describe() in the Hmisc packageskim() from skimrcreate_report() from Data Explorertable
table(surveyrbioc$Q3)
good intermediate none poor
3 3 8 8
table(surveyrbioc$Q4)
good intermediate none poor pro
2 6 4 9 1
table(surveyrbioc$Q2,surveyrbioc$Q3)
good intermediate none poor
Master student/else 0 1 1 4
PhD student 0 1 6 3
Postdoc 3 1 1 1
addmargins()prop.table()ftable()addmargins(table(surveyrbioc$Q2,surveyrbioc$Q3))
good intermediate none poor Sum
Master student/else 0 1 1 4 6
PhD student 0 1 6 3 10
Postdoc 3 1 1 1 6
Sum 3 3 8 8 22
prop.table(table(surveyrbioc$Q2,surveyrbioc$Q3))
good intermediate none poor
Master student/else 0.00000000 0.04545455 0.04545455 0.18181818
PhD student 0.00000000 0.04545455 0.27272727 0.13636364
Postdoc 0.13636364 0.04545455 0.04545455 0.04545455
Please always do check the docs!
The MASS package contains the dataset Cars93, which stores the data on 93 makes of car sold in US
Type specifies the type of market the car is aimed at. Find the cheapest car in each type, and the one with the greatest fuel efficiencydata.frames, one for US cars, the other one with non-US cars.RData)library(MASS)
head(Cars93)
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags
1 Acura Integra Small 12.9 15.9 18.8 25 31 None
2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger
3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only
4 Audi 100 Midsize 30.8 37.7 44.6 19 26 Driver & Passenger
5 BMW 535i Midsize 23.7 30.0 36.2 22 30 Driver only
6 Buick Century Midsize 14.2 15.7 17.3 22 31 Driver only
DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
1 Front 4 1.8 140 6300 2890 Yes 13.2
2 Front 6 3.2 200 5500 2335 Yes 18.0
3 Front 6 2.8 172 5500 2280 Yes 16.9
4 Front 6 2.8 172 5500 2535 Yes 21.1
5 Rear 4 3.5 208 5700 2545 Yes 21.1
6 Front 4 2.2 110 5200 2565 No 16.4
Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin
1 5 177 102 68 37 26.5 11 2705 non-USA
2 5 195 115 71 38 30.0 15 3560 non-USA
3 5 180 102 67 37 28.0 14 3375 non-USA
4 6 193 106 70 37 31.0 17 3405 non-USA
5 4 186 109 69 39 27.0 13 3640 non-USA
6 6 189 105 69 41 28.0 16 2880 USA
Make
1 Acura Integra
2 Acura Legend
3 Audi 90
4 Audi 100
5 BMW 535i
6 Buick Century
?Cars93
tapply(X = Cars93$Min.Price,INDEX = Cars93$Type,FUN = min)
Compact Large Midsize Small Sporty Van
8.5 17.5 12.4 6.7 9.1 13.6
tapply(X = Cars93$Horsepower,INDEX = Cars93$Type,FUN = mean)
Compact Large Midsize Small Sporty Van
131.0000 179.4545 173.0909 91.0000 160.1429 149.4444
table(Cars93$Origin)
USA non-USA
48 45
us_cars <- Cars93[Cars93$Origin == "USA",]
nonus_cars <- Cars93[Cars93$Origin != "USA",]
# write.csv(us_cars, file = "us_cars.csv")
# save(nonus_cars, file = "nonus_cars.RData")
Many ways for the same task:
plot)ggplot2latticeplotly, ggvis or other librariesWhy bother plotting at all?
Look for some inspiration here, it is an excellent place to start and learn!
plot functionFirst thing: take a look at the overview documentation of plot
?plot
We will see
plot parametersRequired:
Other options
mainxlab and ylabxlim and ylimpch, col and cex - as atomic elements or as vectorsmpglibrary(ggplot2) # this is useful per se, and contains the dataset we will be using
?mpg
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov
# works on RStudio
# View(mpg)
# otherwise stick to the classic
str(mpg)
tibble [234 × 12] (S3: tbl_df/tbl/data.frame)
$ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
$ model : chr [1:234] "a4" "a4" "a4" "a4" ...
$ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
$ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
$ drv : chr [1:234] "f" "f" "f" "f" ...
$ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
$ fl : chr [1:234] "p" "p" "p" "p" ...
$ class : chr [1:234] "compact" "compact" "compact" "compact" ...
$ mygroup : num [1:234] 2 2 2 2 2 2 2 2 2 2 ...
Make a guess: what do you expect to see between fuel consumption and engine size?
plot(mpg$displ,mpg$cty)
Bonus: what is the correlation?
cor(mpg$displ,mpg$cty)
[1] -0.798524
cor(mpg$displ,mpg$cty,method="spearman")
[1] -0.8809049
mpg$mygroup <- as.numeric(factor(mpg$class))
plot(mpg$displ,mpg$cty,
col = mpg$mygroup)
legend("topright",legend = levels(factor(mpg$class)),col=levels(factor(mpg$mygroup)),pch=1)
plot(mpg$displ,mpg$cty,
pch = as.numeric(factor((mpg$class))))
This shows we have quite some overlap of points. What can we do?
Adding some jitter…
plot(x = mpg$displ + rnorm(nrow(mpg),mean = 0,sd = 0.01),
y = mpg$cty + rnorm(rnorm(nrow(mpg),mean = 0,sd = 0.01)),
col = mpg$mygroup,
main = "now with jitter!")
Adding a smoothing line
Trying to see a pattern? Add a smoothing curve.
This one is wrong - missing the reordering of points
plot(mpg$displ,mpg$cty, col = mpg$mygroup)
myloess <- loess(cty~displ, data=mpg)
myfit <- fitted(myloess)
lines(mpg$displ,myfit)
legend("topright",legend = levels(factor(mpg$class)),col=levels(factor(mpg$mygroup)),pch=1)
This one is correct!
plot(mpg$displ,mpg$cty, col = mpg$mygroup)
myloess <- loess(cty~displ, data=mpg)
myfit <- fitted(myloess)
myord <- order(mpg$displ)
lines(mpg$displ[myord],myfit[myord])
legend("topright",legend = levels(factor(mpg$class)),col=levels(factor(mpg$mygroup)),pch=1)
lines can add (almost) anything (any line).
points works in a similar way to superimpose, well, points
?barplot
academia_levels <- table(surveyrbioc$Q2)
barplot(academia_levels)
How is the age distributed across academic levels? Check the help of boxplot
y~x variables - ok, it can get more complicatedboxplot(Q1~Q2,
data = surveyrbioc)
Splitting on more factors
boxplot(Q1~Q2+Q3,
data = surveyrbioc)
Making it more readable…
boxplot(Q1~Q2+Q3,
data = surveyrbioc,
las = 2)
Changing the parameters allows you to control many aspects on plot appearance
par is your best friend - and enemy (see ?par)
par(mar=c(15,3,2,2))
boxplot(Q1~Q2+Q3,data = surveyrbioc,las = 2)
par( ... ) has many arguments; here, the useful/most used ones
mar for handling the marginscex, col, pch and co. are all parameters of parlas to change the style of the axis labelsmfrow to draw an array of figureshist(surveyrbioc$Q1,breaks = 8)
hist(mpg$cty,breaks = 10)
hist(mpg$cty,breaks = 10, col = "steelblue")
hist(mpg$cty,breaks = 10, col = "steelblue", border = "gray")
hist(mpg$cty,breaks = 10, col = "steelblue", border = "gray",main = "Distribution of miles/gallon consumption in city traffic")
DON’T.
If you really need to do it…
?pie
example(pie) # expecially the last one
pie> require(grDevices)
pie> pie(rep(1, 24), col = rainbow(24), radius = 0.9)
pie> pie.sales <- c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)
pie> names(pie.sales) <- c("Blueberry", "Cherry",
pie+ "Apple", "Boston Cream", "Other", "Vanilla Cream")
pie> pie(pie.sales) # default colours
pie> pie(pie.sales, col = c("purple", "violetred1", "green3",
pie+ "cornsilk", "cyan", "white"))
pie> pie(pie.sales, col = gray(seq(0.4, 1.0, length.out = 6)))
pie> pie(pie.sales, density = 10, angle = 15 + 10 * 1:6)
pie> pie(pie.sales, clockwise = TRUE, main = "pie(*, clockwise = TRUE)")
pie> segments(0, 0, 0, 1, col = "red", lwd = 2)
pie> text(0, 1, "init.angle = 90", col = "red")
pie> n <- 200
pie> pie(rep(1, n), labels = "", col = rainbow(n), border = NA,
pie+ main = "pie(*, labels=\"\", col=rainbow(n), border=NA,..")
pie> ## Another case showing pie() is rather fun than science:
pie> ## (original by FinalBackwardsGlance on http://imgur.com/gallery/wWrpU4X)
pie> pie(c(Sky = 78, "Sunny side of pyramid" = 17, "Shady side of pyramid" = 5),
pie+ init.angle = 315, col = c("deepskyblue", "yellow", "yellow3"), border = FALSE)
pie(c(20, 80), init.angle=-40,
col=c("white", "yellow"),
label=c("no pacman", "pacman"),
border = "lightgrey")
… or switch from pie to waffle (seriously)
DON’T. And this time I mean it
sadly enough there would be packages for this, too
a.k.a. Why is this bad?
age_by_group <- tapply(surveyrbioc$Q1,surveyrbioc$Q2,mean)
sd_by_group <- tapply(surveyrbioc$Q1,surveyrbioc$Q2,sd)
mybar <- barplot(age_by_group,col=c("khaki","salmon","firebrick"), ylim=c(0,max(age_by_group) + 5))
# mybar, inspect it
arrows(mybar, age_by_group,mybar, (age_by_group + sd_by_group), length = 0.15,angle= 90)
Dynamite plots VS boxplots
boxplot(Q1~Q2,
data = surveyrbioc)
Median VS distribution VS actual points… What do you really want to show?
type in ?plotlogtextexpressionGeneral code structure for this
opendevice()
...
code for the plot
...
closedevice()
pdf("myfilename.pdf")
# see also alternatives:
## png()
## jpeg()
plot(mpg$displ,mpg$cty,
col = mpg$mygroup)
dev.off()
Back to the iris. Three species are there. Explore the dataset in the following ways:
draw a histogram of the petal length. What do you see?
plot sepal length versus petal length. Add different colors to highlight the species
do the same for sepal width and sepal length, and this time use a different symbol for the species. Add a legend and a title if you want
(harder) calculate the mean values of each feature for each species, organizing it in a matrix where the rows are the species names. Generate a stacked bar plot with it, and another one where the bars are arranged horizontally
feel free to go back to the survey data and explore it further!
hist(iris$Petal.Length)
plot(iris$Sepal.Length,iris$Petal.Length)
plot(iris$Sepal.Length,iris$Petal.Length,col=iris$Species)
plot(iris$Sepal.Width,iris$Sepal.Length, pch = as.numeric(factor(iris$Species)))
legend("topright",legend = levels(factor(iris$Species)),pch=unique(factor(iris$Species)))
sl_means <- tapply(iris$Sepal.Length,iris$Species,mean)
pl_means <- tapply(iris$Petal.Length,iris$Species,mean)
sw_means <- tapply(iris$Sepal.Width,iris$Species,mean)
pw_means <- tapply(iris$Petal.Width,iris$Species,mean)
mymat <- cbind(sl_means,pl_means,sw_means,pw_means)
barplot(mymat,legend.text = unique(iris$Species))
barplot(mymat,beside = TRUE,legend.text = unique(iris$Species))
pairs(iris[,1:4],col=iris$Species)
You can use the panels even more cleverly, check the help of pairs!
This is a collection on graphs in R - with the underlying code too.
gapminder projectggplot2But first, meet the gapminder data
library(gapminder)
head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
head(country_colors)
Nigeria Egypt Ethiopia Congo, Dem. Rep. South Africa
"#7F3B08" "#833D07" "#873F07" "#8B4107" "#8F4407"
Sudan
"#934607"
head(continent_colors)
Africa Americas Asia Europe Oceania
"#7F3B08" "#A50026" "#40004B" "#276419" "#313695"
Variables:
ggplot2 philosophygg stands for the grammar of graphics
dataaesthetics (shape, size, colour)geoms to specify how you want to have the data plottedstatistical transformationsfacets allow you to do quick elegant multi plotsIt can come across somewhat harder since
yet, it makes the whole process of “thinking data” more natural.
ggplot(gapminder,aes(x = gdpPercap, y = lifeExp))
ggplot(gapminder,aes(x = gdpPercap, y = lifeExp)) + geom_point()
We can store ggplot plot objects into a variable - and build upon that later
p <- ggplot(gapminder,aes(x = gdpPercap, y = lifeExp))
p + geom_point()
p + geom_point() + scale_x_log10()
p <- p + scale_x_log10()
p + geom_point(color="blue")
p + geom_point(color="steelblue", pch=19, size=8, alpha=1/4)
p + geom_point(aes(color=continent))
p + geom_point(aes(col=continent), size=4)
p + geom_point(aes(col=continent, size=pop))
p + geom_point(aes(col=continent, size=pop)) + geom_smooth()
niceone <- p + geom_point(aes(col=continent, size=pop)) +
geom_smooth(aes(col=continent),se=FALSE)
niceone
p + geom_point(aes(col=continent, size=pop)) +
geom_smooth(lwd=2, se=FALSE, method="lm", col="red")
p + geom_point(aes(col=continent, size=pop)) +
geom_smooth(aes(col=continent),lwd=2, se=FALSE, method="lm")
p + geom_point() + facet_wrap(~continent)
p + geom_point(aes(col=continent)) + facet_wrap(~continent)
p + geom_point(aes(col=continent)) + geom_smooth() + facet_wrap(~continent)
ggsave(file="myplot.png")
ggplot(gapminder,
aes(x = year, y = lifeExp, group = country, color = country)
) +
geom_line(lwd = 1, show.legend = FALSE) +
scale_color_manual(values = country_colors) +
theme_bw() + theme(strip.text = element_text(size = rel(1.1)))
bp <- ggplot(gapminder,
aes(x = year, y = lifeExp, group = country, color = country)
) +
geom_line(lwd = 1, show.legend = FALSE) + facet_wrap(~ continent) +
scale_color_manual(values = country_colors) + theme(strip.text = element_text(size = rel(1.1)))
bp
plotly::ggplotly(bp)
# now it is a categorical x VS continuous y
p <- ggplot(gapminder, aes(x = continent, y = lifeExp))
p + geom_point()
p + geom_point(alpha=1/4)
It is so easy to escape dynamite plots!
p + geom_jitter()
p + geom_jitter(aes(col=continent))
p + geom_boxplot()
p + geom_boxplot() + geom_jitter(alpha=1/2)
p <- ggplot(gapminder, aes(lifeExp))
p + geom_histogram()
p + geom_histogram(binwidth=1)
Stacked histogram are much easier in this framework
p + geom_histogram(aes(color=continent))
p + geom_histogram(aes(fill=continent))
p + geom_histogram(aes(fill=continent), position="identity")
… and so is the superimposing of more than one distribution
p + geom_histogram(aes(fill=continent), position="identity", alpha = 0.4)
Similar to histogram, you can use also density plots
p + geom_density(aes(fill=continent), alpha=1/4)
niceone + theme_bw()
niceone + theme_void()
If you really really really have to…
library("ggthemes")
niceone + theme_excel() + scale_color_excel()
try to recreate the plots you did with base graphics, this time using ggplot2
pick a nice plot you would like to have in your next manuscript: can you think of what you need to do it? I am talking of
Data in bioinformatics is often complex. To deal with this, developers define specialised data containers (termed classes) that match the properties of the data they need to handle.
This aspect is central to the Bioconductor(https://www.bioconductor.org) project which uses the same core data infrastructure across packages. This certainly contributed to Bioconductor’s success. Bioconductor package developers are advised to make use of existing infrastructure to provide coherence, interoperability and stability to the project as a whole.
To illustrate such an omics data container, we’ll present the SummarizedExperiment class.
The figure below represents the anatomy of SummarizedExperiment.
Objects of the class SummarizedExperiment contain :
One (or more) assay(s) containing the quantitative omics data (expression data), stored as a matrix-like object. Features (genes, transcripts, proteins, …) are defined along the rows and samples along the columns.
A sample metadata slot containing sample co-variates, stored as a data frame. Rows from this table represent samples (rows match exactly the columns of the expression data).
A feature metadata slot containing feature co-variates, stored as data frame. The rows of this dataframe’s match exactly the rows of the expression data.
The coordinated nature of the SummarizedExperiment guarantees that during data manipulation, the dimensions of the different slots will always match (i.e the columns in the expression data and then rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. For example, if we had to exclude one sample from the assay, it would be automatically removed from the sample metadata in the same operation.
The metadata slots can grow additional co-variates (columns) without affecting the other structures.
Questions
Q1 - Can you think of data examples what can fit into this container?
Q2 - What if the data has some “specific” peculiarities on top of this tabular-like representation?
rna <- read_csv("data/rnaseq.csv")
head(rna)
# A tibble: 6 × 19
gene sample expression organism age sex infection strain time tissue mouse ENTREZID product
<chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 Asl GSM25… 1170 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 109900 argini…
2 Apod GSM25… 36194 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 11815 apolip…
3 Cyp2d… GSM25… 4060 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 56448 cytoch…
4 Klk6 GSM25… 287 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 19144 kallik…
5 Fcrls GSM25… 85 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 80891 Fc rec…
6 Slc2a4 GSM25… 782 Mus mus… 8 Fema… Influenz… C57BL… 8 Cereb… 14 20528 solute…
# ℹ 6 more variables: ensembl_gene_id <chr>, external_synonym <chr>, chromosome_name <chr>,
# gene_biotype <chr>, phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>
head(as.data.frame(rna))
gene sample expression organism age sex infection strain time tissue mouse
1 Asl GSM2545336 1170 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
2 Apod GSM2545336 36194 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
3 Cyp2d22 GSM2545336 4060 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
4 Klk6 GSM2545336 287 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
5 Fcrls GSM2545336 85 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
6 Slc2a4 GSM2545336 782 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
ENTREZID product
1 109900 argininosuccinate lyase, transcript variant X1
2 11815 apolipoprotein D, transcript variant 3
3 56448 cytochrome P450, family 2, subfamily d, polypeptide 22, transcript variant 2
4 19144 kallikrein related-peptidase 6, transcript variant 2
5 80891 Fc receptor-like S, scavenger receptor, transcript variant X1
6 20528 solute carrier family 2 (facilitated glucose transporter), member 4
ensembl_gene_id external_synonym chromosome_name gene_biotype
1 ENSMUSG00000025533 2510006M18Rik 5 protein_coding
2 ENSMUSG00000022548 <NA> 16 protein_coding
3 ENSMUSG00000061740 2D22 15 protein_coding
4 ENSMUSG00000050063 Bssp 7 protein_coding
5 ENSMUSG00000015852 2810439C17Rik 3 protein_coding
6 ENSMUSG00000018566 Glut-4 11 protein_coding
phenotype_description hsapiens_homolog_associated_gene_name
1 abnormal circulating amino acid level ASL
2 abnormal lipid homeostasis APOD
3 abnormal skin morphology CYP2D6
4 abnormal cytokine level KLK6
5 decreased CD8-positive alpha-beta T cell number FCRL2
6 abnormal circulating glucose level SLC2A4
Remember the rna dataset that we have used previously.
From this table we have already created 3 different tables - we read them in as serialized r objects.
count_matrix <- readRDS("data/count_matrix.RDS")
sample_metadata <- readRDS("data/sample_metadata.RDS")
gene_metadata <- readRDS("data/gene_metadata.RDS")
count_matrix[1:5, ]
GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341 GSM2545342 GSM2545343
Asl 1170 361 400 586 626 988 836 535
Apod 36194 10347 9173 10620 13021 29594 24959 13668
Cyp2d22 4060 1616 1603 1901 2171 3349 3122 2008
Klk6 287 629 641 578 448 195 186 1101
Fcrls 85 233 244 237 180 38 68 375
GSM2545344 GSM2545345 GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545350 GSM2545351
Asl 586 597 938 1035 494 481 666 937
Apod 13230 15868 27769 34301 11258 11812 15816 29242
Cyp2d22 2254 2277 2985 3452 1883 2014 2417 3678
Klk6 537 567 327 233 742 881 828 250
Fcrls 199 177 89 67 300 233 231 81
GSM2545352 GSM2545353 GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl 803 541 473 748 576 1192
Apod 20415 13682 11088 15916 11166 38148
Cyp2d22 2920 2216 1821 2842 2011 4019
Klk6 798 710 894 501 598 259
Fcrls 303 285 248 179 184 68
dim(count_matrix)
[1] 1474 22
sample_metadata
# A tibble: 22 × 9
sample organism age sex infection strain time tissue mouse
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 GSM2545336 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
2 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 9
3 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 10
4 GSM2545339 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 15
5 GSM2545340 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 18
6 GSM2545341 Mus musculus 8 Male InfluenzaA C57BL/6 8 Cerebellum 6
7 GSM2545342 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 5
8 GSM2545343 Mus musculus 8 Male NonInfected C57BL/6 0 Cerebellum 11
9 GSM2545344 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 22
10 GSM2545345 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 13
# ℹ 12 more rows
gene_metadata
# A tibble: 1,474 × 9
gene ENTREZID product ensembl_gene_id external_synonym chromosome_name gene_biotype
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Asl 109900 argininosuccinate… ENSMUSG0000002… 2510006M18Rik 5 protein_cod…
2 Apod 11815 apolipoprotein D,… ENSMUSG0000002… <NA> 16 protein_cod…
3 Cyp2d22 56448 cytochrome P450, … ENSMUSG0000006… 2D22 15 protein_cod…
4 Klk6 19144 kallikrein relate… ENSMUSG0000005… Bssp 7 protein_cod…
5 Fcrls 80891 Fc receptor-like … ENSMUSG0000001… 2810439C17Rik 3 protein_cod…
6 Slc2a4 20528 solute carrier fa… ENSMUSG0000001… Glut-4 11 protein_cod…
7 Exd2 97827 exonuclease 3'-5'… ENSMUSG0000003… 4930539P14Rik 12 protein_cod…
8 Gjc2 118454 gap junction prot… ENSMUSG0000004… B230382L12Rik 11 protein_cod…
9 Plp1 18823 proteolipid prote… ENSMUSG0000003… DM20 X protein_cod…
10 Gnb4 14696 guanine nucleotid… ENSMUSG0000002… 6720453A21Rik 3 protein_cod…
# ℹ 1,464 more rows
# ℹ 2 more variables: phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>
We will create a SummarizedExperiment from these tables:
The count matrix that will be used as the assay
The table describing the samples will be used as the sample metadata slot
The table describing the genes will be used as the features metadata slot
To do this we can put the different parts together using the
SummarizedExperiment constructor:
#BiocManager::install("SummarizedExperiment")
library("SummarizedExperiment")
se <- SummarizedExperiment(assays = count_matrix,
colData = sample_metadata,
rowData = gene_metadata)
se
class: SummarizedExperiment
dim: 1474 22
metadata(0):
assays(1): ''
rownames(1474): Asl Apod ... Lmx1a Pbx1
rowData names(9): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
colData names(9): sample organism ... tissue mouse
Using this data structure, we can access the expression matrix with
the assay function:
head(assay(se))
GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341 GSM2545342 GSM2545343
Asl 1170 361 400 586 626 988 836 535
Apod 36194 10347 9173 10620 13021 29594 24959 13668
Cyp2d22 4060 1616 1603 1901 2171 3349 3122 2008
Klk6 287 629 641 578 448 195 186 1101
Fcrls 85 233 244 237 180 38 68 375
Slc2a4 782 231 248 265 313 786 528 249
GSM2545344 GSM2545345 GSM2545346 GSM2545347 GSM2545348 GSM2545349 GSM2545350 GSM2545351
Asl 586 597 938 1035 494 481 666 937
Apod 13230 15868 27769 34301 11258 11812 15816 29242
Cyp2d22 2254 2277 2985 3452 1883 2014 2417 3678
Klk6 537 567 327 233 742 881 828 250
Fcrls 199 177 89 67 300 233 231 81
Slc2a4 266 357 654 693 271 304 349 715
GSM2545352 GSM2545353 GSM2545354 GSM2545362 GSM2545363 GSM2545380
Asl 803 541 473 748 576 1192
Apod 20415 13682 11088 15916 11166 38148
Cyp2d22 2920 2216 1821 2842 2011 4019
Klk6 798 710 894 501 598 259
Fcrls 303 285 248 179 184 68
Slc2a4 513 320 248 350 317 796
dim(assay(se))
[1] 1474 22
We can access the sample metadata using the colData function:
colData(se)
DataFrame with 22 rows and 9 columns
sample organism age sex infection strain time
<character> <character> <numeric> <character> <character> <character> <numeric>
GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA C57BL/6 8
GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545339 GSM2545339 Mus musculus 8 Female InfluenzaA C57BL/6 4
GSM2545340 GSM2545340 Mus musculus 8 Male InfluenzaA C57BL/6 4
... ... ... ... ... ... ... ...
GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected C57BL/6 0
GSM2545362 GSM2545362 Mus musculus 8 Female InfluenzaA C57BL/6 4
GSM2545363 GSM2545363 Mus musculus 8 Male InfluenzaA C57BL/6 4
GSM2545380 GSM2545380 Mus musculus 8 Female InfluenzaA C57BL/6 8
tissue mouse
<character> <numeric>
GSM2545336 Cerebellum 14
GSM2545337 Cerebellum 9
GSM2545338 Cerebellum 10
GSM2545339 Cerebellum 15
GSM2545340 Cerebellum 18
... ... ...
GSM2545353 Cerebellum 4
GSM2545354 Cerebellum 2
GSM2545362 Cerebellum 20
GSM2545363 Cerebellum 12
GSM2545380 Cerebellum 19
dim(colData(se))
[1] 22 9
We can also access the feature metadata using the rowData function:
head(rowData(se))
DataFrame with 6 rows and 9 columns
gene ENTREZID product ensembl_gene_id external_synonym
<character> <numeric> <character> <character> <character>
Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533 2510006M18Rik
Apod Apod 11815 apolipoprotein D, tr.. ENSMUSG00000022548 NA
Cyp2d22 Cyp2d22 56448 cytochrome P450, fam.. ENSMUSG00000061740 2D22
Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063 Bssp
Fcrls Fcrls 80891 Fc receptor-like S, .. ENSMUSG00000015852 2810439C17Rik
Slc2a4 Slc2a4 20528 solute carrier famil.. ENSMUSG00000018566 Glut-4
chromosome_name gene_biotype phenotype_description hsapiens_homolog_associated_gene_name
<character> <character> <character> <character>
Asl 5 protein_coding abnormal circulating.. ASL
Apod 16 protein_coding abnormal lipid homeo.. APOD
Cyp2d22 15 protein_coding abnormal skin morpho.. CYP2D6
Klk6 7 protein_coding abnormal cytokine le.. KLK6
Fcrls 3 protein_coding decreased CD8-positi.. FCRL2
Slc2a4 11 protein_coding abnormal circulating.. SLC2A4
dim(rowData(se))
[1] 1474 9
SummarizedExperiment can be subset just like with data frames, with numerics or with characters of logicals.
Below, we create a new instance of class SummarizedExperiment that contains only the 5 first features for the 3 first samples.
se1 <- se[1:5, 1:3]
se1
class: SummarizedExperiment
dim: 5 3
metadata(0):
assays(1): ''
rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
rowData names(9): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(3): GSM2545336 GSM2545337 GSM2545338
colData names(9): sample organism ... tissue mouse
colData(se1)
DataFrame with 3 rows and 9 columns
sample organism age sex infection strain time
<character> <character> <numeric> <character> <character> <character> <numeric>
GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA C57BL/6 8
GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0
tissue mouse
<character> <numeric>
GSM2545336 Cerebellum 14
GSM2545337 Cerebellum 9
GSM2545338 Cerebellum 10
rowData(se1)
DataFrame with 5 rows and 9 columns
gene ENTREZID product ensembl_gene_id external_synonym
<character> <numeric> <character> <character> <character>
Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533 2510006M18Rik
Apod Apod 11815 apolipoprotein D, tr.. ENSMUSG00000022548 NA
Cyp2d22 Cyp2d22 56448 cytochrome P450, fam.. ENSMUSG00000061740 2D22
Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063 Bssp
Fcrls Fcrls 80891 Fc receptor-like S, .. ENSMUSG00000015852 2810439C17Rik
chromosome_name gene_biotype phenotype_description hsapiens_homolog_associated_gene_name
<character> <character> <character> <character>
Asl 5 protein_coding abnormal circulating.. ASL
Apod 16 protein_coding abnormal lipid homeo.. APOD
Cyp2d22 15 protein_coding abnormal skin morpho.. CYP2D6
Klk6 7 protein_coding abnormal cytokine le.. KLK6
Fcrls 3 protein_coding decreased CD8-positi.. FCRL2
We can also use the colData() function to subset on something from the sample metadata, or the rowData() to subset on something from the feature metadata.
For example, here we keep only miRNAs and the non infected samples:
se1 <- se[rowData(se)$gene_biotype == "miRNA",
colData(se)$infection == "NonInfected"]
se1
class: SummarizedExperiment
dim: 7 7
metadata(0):
assays(1): ''
rownames(7): Mir1901 Mir378a ... Mir128-1 Mir7682
rowData names(9): gene ENTREZID ... phenotype_description
hsapiens_homolog_associated_gene_name
colnames(7): GSM2545337 GSM2545338 ... GSM2545353 GSM2545354
colData names(9): sample organism ... tissue mouse
assay(se1)
GSM2545337 GSM2545338 GSM2545343 GSM2545348 GSM2545349 GSM2545353 GSM2545354
Mir1901 45 44 74 55 68 33 60
Mir378a 11 7 9 4 12 4 8
Mir133b 4 6 5 4 6 7 3
Mir30c-2 10 6 16 12 8 17 15
Mir149 1 2 0 0 0 0 2
Mir128-1 4 1 2 2 1 2 1
Mir7682 2 0 4 1 3 5 5
colData(se1)
DataFrame with 7 rows and 9 columns
sample organism age sex infection strain time
<character> <character> <numeric> <character> <character> <character> <numeric>
GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545343 GSM2545343 Mus musculus 8 Male NonInfected C57BL/6 0
GSM2545348 GSM2545348 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545349 GSM2545349 Mus musculus 8 Male NonInfected C57BL/6 0
GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected C57BL/6 0
tissue mouse
<character> <numeric>
GSM2545337 Cerebellum 9
GSM2545338 Cerebellum 10
GSM2545343 Cerebellum 11
GSM2545348 Cerebellum 8
GSM2545349 Cerebellum 7
GSM2545353 Cerebellum 4
GSM2545354 Cerebellum 2
rowData(se1)
DataFrame with 7 rows and 9 columns
gene ENTREZID product ensembl_gene_id external_synonym chromosome_name
<character> <numeric> <character> <character> <character> <character>
Mir1901 Mir1901 100316686 microRNA 1901 ENSMUSG00000084565 Mirn1901 18
Mir378a Mir378a 723889 microRNA 378a ENSMUSG00000105200 Mirn378 18
Mir133b Mir133b 723817 microRNA 133b ENSMUSG00000065480 mir 133b 1
Mir30c-2 Mir30c-2 723964 microRNA 30c-2 ENSMUSG00000065567 mir 30c-2 1
Mir149 Mir149 387167 microRNA 149 ENSMUSG00000065470 Mirn149 1
Mir128-1 Mir128-1 387147 microRNA 128-1 ENSMUSG00000065520 Mirn128 1
Mir7682 Mir7682 102466847 microRNA 7682 ENSMUSG00000106406 mmu-mir-7682 1
gene_biotype phenotype_description hsapiens_homolog_associated_gene_name
<character> <character> <character>
Mir1901 miRNA NA NA
Mir378a miRNA abnormal mitochondri.. MIR378A
Mir133b miRNA no abnormal phenotyp.. MIR133B
Mir30c-2 miRNA NA MIR30C2
Mir149 miRNA increased circulatin.. NA
Mir128-1 miRNA no abnormal phenotyp.. MIR128-1
Mir7682 miRNA NA NA
For the following exercise, you should download the SE.rda object
(that contains the se object), and open the file using the
‘load()’ function.
download.file(url = "https://raw.githubusercontent.com/UCLouvain-CBIO/bioinfo-training-01-intro-r/master/data/SE.rda",
destfile = "data/SE.rda")
load(file = "data/SE.rda")
Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8.
assay(se)[1:3, colData(se)$time != 4]
GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343 GSM2545346 GSM2545347
Asl 1170 361 400 988 836 535 938 1035
Apod 36194 10347 9173 29594 24959 13668 27769 34301
Cyp2d22 4060 1616 1603 3349 3122 2008 2985 3452
GSM2545348 GSM2545349 GSM2545351 GSM2545353 GSM2545354 GSM2545380
Asl 494 481 937 541 473 1192
Apod 11258 11812 29242 13682 11088 38148
Cyp2d22 1883 2014 3678 2216 1821 4019
# Equivalent to
assay(se)[1:3, colData(se)$time == 0 | colData(se)$time == 8]
GSM2545336 GSM2545337 GSM2545338 GSM2545341 GSM2545342 GSM2545343 GSM2545346 GSM2545347
Asl 1170 361 400 988 836 535 938 1035
Apod 36194 10347 9173 29594 24959 13668 27769 34301
Cyp2d22 4060 1616 1603 3349 3122 2008 2985 3452
GSM2545348 GSM2545349 GSM2545351 GSM2545353 GSM2545354 GSM2545380
Asl 494 481 937 541 473 1192
Apod 11258 11812 29242 13682 11088 38148
Cyp2d22 1883 2014 3678 2216 1821 4019
We can also add information to the metadata. Suppose that you want to add the center where the samples were collected…
colData(se)$center <- rep("University of Illinois", nrow(colData(se)))
colData(se)
DataFrame with 22 rows and 10 columns
sample organism age sex infection strain time
<character> <character> <numeric> <character> <character> <character> <numeric>
GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA C57BL/6 8
GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545339 GSM2545339 Mus musculus 8 Female InfluenzaA C57BL/6 4
GSM2545340 GSM2545340 Mus musculus 8 Male InfluenzaA C57BL/6 4
... ... ... ... ... ... ... ...
GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected C57BL/6 0
GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected C57BL/6 0
GSM2545362 GSM2545362 Mus musculus 8 Female InfluenzaA C57BL/6 4
GSM2545363 GSM2545363 Mus musculus 8 Male InfluenzaA C57BL/6 4
GSM2545380 GSM2545380 Mus musculus 8 Female InfluenzaA C57BL/6 8
tissue mouse center
<character> <numeric> <character>
GSM2545336 Cerebellum 14 University of Illinois
GSM2545337 Cerebellum 9 University of Illinois
GSM2545338 Cerebellum 10 University of Illinois
GSM2545339 Cerebellum 15 University of Illinois
GSM2545340 Cerebellum 18 University of Illinois
... ... ... ...
GSM2545353 Cerebellum 4 University of Illinois
GSM2545354 Cerebellum 2 University of Illinois
GSM2545362 Cerebellum 20 University of Illinois
GSM2545363 Cerebellum 12 University of Illinois
GSM2545380 Cerebellum 19 University of Illinois
This illustrates that the metadata slots can grow indefinitely without affecting the other structures!
Take-home message
SummarizedExperiment represent an efficient way to store and to handle omics data.
They are used in many Bioconductor packages.
If you follow next training focused on RNA sequencing analysis, you will learn to
use the Bioconductor DESeq2 package to do some differential expression analyses.
DESeq2’s whole analysis is handled in a SummarizedExperiment.
Books
R in a nutshell, R cookbook, R graphics cookbook (@O’Reilly media)
A Beginner’s Guide to R (Zuur, Ieno, Meesters, @Springer)
R Programming for Data Science (Peng, @Leanpub)
R Programming for Bioinformatics (Gentleman, @CRC)
Bioconductor Case Studies (@Springer)
Data Analysis for the Life Sciences (Irizarry, Love, @Leanpub)
Bioconductor - An Introduction to Core Technologies (Hansen, @Leanpub)
R for Data Science (Wickham, Grolemund, @O’Reilly)
the whole Use R! book series: https://link.springer.com/bookseries/6991
this one from CRC: https://www.crcpress.com/Chapman--HallCRC-The-R-Series/book-series/CRCTHERSER
Courses
Misc
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] clusterProfiler_4.8.1 topGO_2.52.0 SparseM_1.81
[4] GO.db_3.17.0 graph_1.78.0 GeneTonic_2.4.0
[7] iSEEu_1.12.0 iSEEhex_1.2.0 iSEE_2.12.0
[10] SingleCellExperiment_1.22.0 pheatmap_1.0.12 apeglm_1.22.1
[13] pcaExplorer_2.26.1 bigmemory_4.6.1 edgeR_3.42.4
[16] limma_3.56.2 ExploreModelMatrix_1.12.0 GenomicFeatures_1.52.1
[19] org.Hs.eg.db_3.17.0 AnnotationDbi_1.62.1 DESeq2_1.40.2
[22] tximeta_1.18.0 macrophage_1.16.0 knitr_1.43
[25] BiocStyle_2.28.0 SummarizedExperiment_1.30.2 Biobase_2.60.0
[28] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1 IRanges_2.34.1
[31] S4Vectors_0.38.1 BiocGenerics_0.46.0 MatrixGenerics_1.12.2
[34] matrixStats_1.0.0 lubridate_1.9.2 forcats_1.0.0
[37] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1
[40] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[43] tidyverse_2.0.0 ggthemes_4.2.4 gapminder_1.0.0
[46] ggplot2_3.4.2 MASS_7.3-60 rmarkdown_2.22
loaded via a namespace (and not attached):
[1] GSEABase_1.62.0 vroom_1.6.3 progress_1.2.2
[4] DT_0.28 Biostrings_2.68.1 pagedown_0.20
[7] vctrs_0.6.3 digest_0.6.32 png_0.1-8
[10] shape_1.4.6 shinyBS_0.61.1 registry_0.5-1
[13] ggrepel_0.9.3 reshape2_1.4.4 qvalue_2.32.0
[16] httpuv_1.6.11 foreach_1.5.2 withr_2.5.0
[19] ggfun_0.1.1 xfun_0.39 ellipsis_0.3.2
[22] survival_3.5-5 memoise_2.0.1 xaringanExtra_0.7.0
[25] hexbin_1.28.3 gson_0.1.0 tidytree_0.4.2
[28] GlobalOptions_0.1.2 prettyunits_1.1.1 KEGGREST_1.40.0
[31] promises_1.2.0.1 httr_1.4.6 downloader_0.4
[34] restfulr_0.0.15 ps_1.7.5 rstudioapi_0.14
[37] shinyAce_0.4.2 DOSE_3.26.1 miniUI_0.1.1.1
[40] generics_0.1.3 archive_1.1.5 base64enc_0.1-3
[43] processx_3.8.1 curl_5.0.1 zlibbioc_1.46.0
[46] ggraph_2.1.0 polyclip_1.10-4 ca_0.71.1
[49] GenomeInfoDbData_1.2.10 RBGL_1.76.0 threejs_0.3.3
[52] interactiveDisplayBase_1.38.0 xtable_1.8-4 doParallel_1.0.17
[55] evaluate_0.21 S4Arrays_1.0.4 BiocFileCache_2.8.0
[58] hms_1.1.3 bookdown_0.34 colorspace_2.1-0
[61] filelock_1.0.2 visNetwork_2.1.2 shinyWidgets_0.7.6
[64] magrittr_2.0.3 Rgraphviz_2.44.0 ggtree_3.8.0
[67] later_1.3.1 viridis_0.6.3 lattice_0.21-8
[70] NMF_0.26 genefilter_1.82.1 shadowtext_0.1.2
[73] XML_3.99-0.14 cowplot_1.1.1 pillar_1.9.0
[76] nlme_3.1-162 iterators_1.0.14 gridBase_0.4-7
[79] compiler_4.3.0 stringi_1.7.12 shinycssloaders_1.0.0
[82] Category_2.66.0 TSP_1.2-4 dendextend_1.17.1
[85] GenomicAlignments_1.36.0 plyr_1.8.8 crayon_1.5.2
[88] BiocIO_1.10.0 gridGraphics_0.5-1 emdbook_1.3.12
[91] locfit_1.5-9.8 graphlayouts_1.0.0 bit_4.0.5
[94] chromote_0.1.1 fastmatch_1.1-3 codetools_0.2-19
[97] crosstalk_1.2.0 bslib_0.5.0 GetoptLong_1.0.5
[100] plotly_4.10.2 mime_0.12 splines_4.3.0
[103] circlize_0.4.15 Rcpp_1.0.10 servr_0.27
[106] dbplyr_2.3.2 tippy_0.1.0 HDO.db_0.99.1
[109] blob_1.2.4 utf8_1.2.3 clue_0.3-64
[112] BiocVersion_3.17.1 AnnotationFilter_1.24.0 fs_1.6.2
[115] backbone_2.1.2 expm_0.999-7 ggplotify_0.1.1
[118] Matrix_1.5-4.1 tzdb_0.4.0 tweenr_2.0.2
[121] pkgconfig_2.0.3 tools_4.3.0 cachem_1.0.8
[124] RSQLite_2.3.1 viridisLite_0.4.2 DBI_1.1.3
[127] numDeriv_2016.8-1.1 fastmap_1.1.1 scales_1.2.1
[130] grid_4.3.0 shinydashboard_0.7.2 Rsamtools_2.16.0
[133] AnnotationHub_3.8.0 sass_0.4.6 patchwork_1.1.2
[136] coda_0.19-4 BiocManager_1.30.21 fontawesome_0.5.1
[139] farver_2.1.1 scatterpie_0.2.1 tidygraph_1.2.3
[142] mgcv_1.8-42 yaml_2.3.7 renderthis_0.2.0
[145] AnnotationForge_1.42.2 rtracklayer_1.60.0 cli_3.6.1
[148] webshot_0.5.5 lifecycle_1.0.3 rsconnect_0.8.29
[151] mvtnorm_1.2-2 xaringan_0.28 tximport_1.28.0
[154] rintrojs_0.3.2 BiocParallel_1.34.2 annotate_1.78.0
[157] timechange_0.2.0 gtable_0.3.3 rjson_0.2.21
[160] ggridges_0.5.4 ape_5.7-1 parallel_4.3.0
[163] jsonlite_1.8.5 colourpicker_1.2.0 seriation_1.4.2
[166] bitops_1.0-7 bigmemory.sri_0.1.6 bit64_4.0.5
[169] assertthat_0.2.1 yulab.utils_0.0.6 heatmaply_1.4.2
[172] bs4Dash_2.3.0 bdsmatrix_1.3-6 icons_0.2.0
[175] jquerylib_0.1.4 highr_0.10 GOSemSim_2.26.0
[178] shinyjs_2.1.0 lazyeval_0.2.2 shiny_1.7.4
[181] dynamicTreeCut_1.63-1 enrichplot_1.20.0 htmltools_0.5.5
[184] rappdirs_0.3.3 formatR_1.14 ensembldb_2.24.0
[187] glue_1.6.2 XVector_0.40.0 RCurl_1.98-1.12
[190] treeio_1.24.1 ComplexUpset_1.3.3 gridExtra_2.3
[193] igraph_1.5.0 R6_2.5.1 labeling_0.4.2
[196] cluster_2.1.4 rngtools_1.5.2 bbmle_1.0.25
[199] aplot_0.1.10 DelayedArray_0.26.3 tidyselect_1.2.0
[202] vipor_0.4.5 ProtGenerics_1.32.0 GOstats_2.66.0
[205] ggforce_0.4.1 xml2_1.3.4 emo_0.0.0.9000
[208] munsell_0.5.0 data.table_1.14.8 websocket_1.4.1
[211] fgsea_1.26.0 htmlwidgets_1.6.2 ComplexHeatmap_2.16.0
[214] RColorBrewer_1.1-3 biomaRt_2.56.1 rlang_1.1.1
[217] uuid_1.1-0 fansi_1.0.4